
New York City (NYC), often called New York (NY), is the most populous city in the United States. With an estimated 2019 population of 8,336,817 distributed over about 302.6 square miles (784 km2), New York is also the most densely populated major city in the United States.
Situated on one of the world's largest natural harbors, New York City is composed of five boroughs, each of which is a county of the State of New York. The five boroughs—Brooklyn, Queens, Manhattan, the Bronx, and Staten Island—were consolidated into a single city in 1898. The city and its metropolitan area constitute the premier gateway for legal immigration to the United States. As many as 800 languages are spoken in New York, making it the most linguistically diverse city in the world. New York is home to more than 3.2 million residents born outside the United States, the largest foreign-born population of any city in the world as of 2016. As of 2019, the New York metropolitan area is estimated to produce a gross metropolitan product (GMP) of $2.0 trillion. If the New York metropolitan area were a sovereign state, it would have the eighth-largest economy in the world. New York is home to the highest number of billionaires of any city in the world.
With a population of 8 million, New York is a city with a high population and produces high GDP. Being such a crowded city leads the owners of shops and social sharing places in the city where the population is dense. Business investors expects lower real estate cost, with high density of population and the type of business they want to install is less intense. It is difficult to obtain information that will guide investors in this direction, nowadays.
When we consider all these problems, we can create a map and information chart where the real estate index is placed on New York and each district is clustered according to the venue density. This would help the investor to decide the ideal location to run the business based on the factors mentioned above
Data requirements includes a. spatial data of new york to build maps with boundaries, b. average sales price per sqm for every borough, c. venue data of the neighborhoods. Venue like coffee shops, entertainment related venues depending on the business requirements would be considered as a baseline parameter.
Data Source
First step, load the new york Geospatial data into dataframe for map generation.
with open('newyork_data.json') as json_data:
newyork_data = json.load(json_data)
In the below section New York Housing Sales Data is loaded, cleansed for using it in our comparison
df_housing = pd.read_csv('nyc-rolling-sales.csv')
##lets remove if there are null values
df_housing=df_housing.dropna()
# finally check if there is any duplicated value and drop if there are
print(sum(df_housing.duplicated()))
df_housing.drop_duplicates(inplace=True)
0
There are too many columns we need only YEAR BUILT, BOROUGH, SALE PRICE. First step let us remove the columns and load it into a new data frame.
df_housing_clean = df_housing[['BOROUGH', 'GROSS SQUARE FEET', 'LAND SQUARE FEET', 'SALE PRICE']]
df_housing_clean = df_housing_clean.rename(columns={'GROSS SQUARE FEET': 'gsqf',
'LAND SQUARE FEET': 'lsqf',
'SALE PRICE' : 'sale_price',
'BOROUGH' : 'borough'})
df_housing_clean["sale_price"] = df_housing_clean["sale_price"].str.strip()
df_housing_clean["gsqf"] = df_housing_clean["gsqf"].str.strip()
df_housing_clean["lsqf"] = df_housing_clean["lsqf"].str.strip()
df_housing_clean = df_housing_clean.query("sale_price != '-' ")
df_housing_clean = df_housing_clean.query("gsqf != '-' ")
df_housing_clean = df_housing_clean.query("lsqf != '-' ")
df_housing_clean['sale_price'] = pd.to_numeric(df_housing_clean['sale_price'], errors='coerce')
df_housing_clean['gsqf'] = pd.to_numeric(df_housing_clean['gsqf'], errors='coerce')
df_housing_clean['lsqf'] = pd.to_numeric(df_housing_clean['lsqf'], errors='coerce')
df_housing_clean = df_housing[['BOROUGH', 'SALE PRICE', 'YEAR BUILT', 'TOTAL UNITS']]
df_housing_clean[df_housing_clean['YEAR BUILT']==0]['YEAR BUILT'].count()
6970
df_housing_clean=df_housing_clean[df_housing_clean['YEAR BUILT']!=0].copy()
df_housing_clean[df_housing_clean['TOTAL UNITS']==0]['TOTAL UNITS'].count()
16864
df_housing_clean=df_housing_clean[df_housing_clean['TOTAL UNITS']!=0].copy()
df_housing_clean = df_housing_clean.rename(columns={'SALE PRICE' : 'sale_price',
'BOROUGH' : 'borough'})
df_housing_clean.shape
(60714, 4)
# Mapping from number reference to borough
#Manhattan (1), Bronx (2), Brooklyn (3), Queens (4), and Staten Island (5)
state_abbrev = {1: 'Manhattan',
2: 'Bronx',
3: 'Brooklyn',
4: 'Queens',
5: 'Staten Island'}
# Function to apply
def abbreviate_borough(housing):
if housing['borough'] in state_abbrev.keys():
abbrev = state_abbrev[housing['borough']]
return abbrev
else:
return housing['borough']
df_housing_clean['borough'] = df_housing_clean.apply(abbreviate_borough, axis=1)
next step, convert the sale_price to numeric
df_housing_clean["sale_price"] = df_housing_clean["sale_price"].str.strip()
df_housing_clean = df_housing_clean.query("sale_price != '-' ")
df_housing_clean.shape
(49128, 4)
df_housing_clean['sale_price'] = pd.to_numeric(df_housing_clean['sale_price'], errors='coerce')
df_housing_clean[(df_housing_clean['sale_price']<10000) | (df_housing_clean['sale_price']>10000000)]['sale_price'].count() /len(df_housing_clean)
0.2159664549747598
21% of the sale prices are either less than 10,000 or greater than $10,000,000. We have to drop all these observations from the data
df_housing_clean= df_housing_clean[(df_housing_clean['sale_price']>10000) & (df_housing_clean['sale_price']<10000000)].copy()
df_housing_clean.describe()
| sale_price | YEAR BUILT | TOTAL UNITS | |
|---|---|---|---|
| count | 3.845700e+04 | 38457.000000 | 38457.000000 |
| mean | 1.077431e+06 | 1953.056115 | 2.191123 |
| std | 1.270522e+06 | 37.838285 | 17.841976 |
| min | 1.000100e+04 | 1800.000000 | 1.000000 |
| 25% | 4.550000e+05 | 1923.000000 | 1.000000 |
| 50% | 6.880000e+05 | 1941.000000 | 1.000000 |
| 75% | 1.110000e+06 | 1989.000000 | 2.000000 |
| max | 9.999999e+06 | 2017.000000 | 2261.000000 |
df_housing_clean= df_housing_clean[(df_housing_clean['sale_price']<4000000)]
plt.figure(figsize=(12,6))
sns.distplot(df_housing_clean['sale_price'], kde=True, bins=50, rug=True)
plt.show()
plt.figure(figsize=(12,6))
sns.boxplot(y = 'borough', x = 'sale_price', data = df_housing_clean )
plt.title('Box plots for SALE PRICE on each BOROUGH')
plt.show()
plt.figure(figsize=(12,6))
sns.barplot(y = 'borough', x = 'sale_price', data = df_bar )
plt.title('Average SALE PRICE on each BOROUGH')
plt.show()
below Population, Density & GDP data is extracted manually from wikipedia.
df_ny_population = pd.read_csv('nyk_population.csv')
df_ny_population = df_ny_population.sort_values(by='population', ascending=True)
df_ny_population
| borough | county | population | gdp_billions | gdp_capita | sqms | sqkm | pers_sqms | pers_sqkm | |
|---|---|---|---|---|---|---|---|---|---|
| 4 | Staten Island | Richmond | 476143 | 14.514 | 30500 | 58.37 | 151.18 | 8157 | 3150 |
| 0 | Bronx | Bronx | 1418207 | 42.695 | 30100 | 42.10 | 109.04 | 33867 | 13006 |
| 2 | Manhattan | New York | 1628706 | 600.244 | 368500 | 22.83 | 59.13 | 71341 | 27544 |
| 3 | Queens | Queens | 2253858 | 93.310 | 41400 | 108.53 | 281.09 | 20767 | 8018 |
| 1 | Brooklyn | Kings | 2559903 | 91.559 | 35800 | 70.82 | 183.42 | 36147 | 13957 |
population/stats data merged with avg housing sales price
df_ny_stats = pd.merge(df_ny_population, df_bar)
df_ny_stats
| borough | county | population | gdp_billions | gdp_capita | sqms | sqkm | pers_sqms | pers_sqkm | sale_price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Staten Island | Richmond | 476143 | 14.514 | 30500 | 58.37 | 151.18 | 8157 | 3150 | 5.067751e+05 |
| 1 | Bronx | Bronx | 1418207 | 42.695 | 30100 | 42.10 | 109.04 | 33867 | 13006 | 5.579480e+05 |
| 2 | Manhattan | New York | 1628706 | 600.244 | 368500 | 22.83 | 59.13 | 71341 | 27544 | 1.571737e+06 |
| 3 | Queens | Queens | 2253858 | 93.310 | 41400 | 108.53 | 281.09 | 20767 | 8018 | 7.098848e+05 |
| 4 | Brooklyn | Kings | 2559903 | 91.559 | 35800 | 70.82 | 183.42 | 36147 | 13957 | 1.038700e+06 |
fig, ax = plt.subplots(figsize=(12, 12), nrows=2, ncols=2)
#first graph - bar plot highlighting avg sale price borough level
g1 = ax[0, 0].bar(data=df_bar, x='borough', height='sale_price');
#second graph pie chart for population
data = df_ny_stats['population']
label = df_ny_stats['borough']
g2= ax[0, 1].pie(data, autopct='%1.1f%%', shadow=True, startangle=90, labels=label);
#third graph pie chart for gdp
data = df_ny_stats['gdp_capita']
label = df_ny_stats['borough']
g3= ax[1,0].pie(data, autopct='%1.1f%%', shadow=True, startangle=90, labels=label);
#fourth graph - bar plot highlighting persons per sqm borough level
g4 = ax[1,1].bar(data=df_ny_stats, x='borough', height='pers_sqkm');
ax[0, 0].title.set_text('Average SALE PRICE on each BOROUGH')
ax[0, 1].title.set_text('Population')
ax[1, 0].title.set_text('GDP')
ax[1, 1].title.set_text('Persons / SQ Meter on each BOROUGH')
plt.suptitle('#Comparison of GDP, Population, Sale Price', y=1.03, fontsize=14, fontweight='semibold');
GDP The Major contributor of GDP is Manhattan, followed by Queens, Brooklyn, Stalen Island & Bronx. Avg sale price Prices were Manhattan, Brooklyn, Queens, Staten Island, Bronx. Person/SQM Person per SQM is high in the order Manhattan, Brooklyn, Bronx, Queens. Population Brooklyn, Queens, Manhattan, Bronx & Staten Island
I wish to perform neighborhoods exploration for Queens considering the facts mentioned in our problem description - lower real estate cost, population and better GDP. This will help in lowering the investment as well as consider better returns. In the next section lets explore the neighborhood data followed by segmentation to find top venues.
Google Map, ‘Search Nearby’ option to get the center coordinates of the each Borough. Population data is captured from wiki page. We will be using Foursquare API for Neighborhoods data exploration. The information we want to focus on are shopping venues, coffee shops, and entertainment venues. We will choose top 2 boroughs based on population: Manhattan & Brooklyn. We need to apply Neighborhood Segmentation and Clustering to analyzing the neighborhood data and prioritize the best shopping location in both boroughs based on foot traffic and type of venues available. This helps the investor to choose the best place for business investment.
I am using Folium library to load the newyork map
address = 'New York City, NY'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))
The geograpical coordinate of New York City are 40.7127281, -74.0060152.
empty_map = folium.Map(location=[latitude, longitude], zoom_start=10)
map_newyork = empty_map
df_ny_stats = df_ny_stats.rename(columns={'borough' : 'Borough'})
# create map of New York using latitude and longitude values
markers_colors = []
with open('nyborough.geojson') as json_data:
nyborough_data = json.load(json_data)
#label = folium.Popup( 'cluster '+str(cluster) + ' - ' + str(poi) + ' : ' + str(int(avg_hp)) + '. NH :' , parse_html=True)
map_newyork.choropleth(
geo_data=nyborough_data,
data=df_ny_stats,
columns=['Borough', 'sale_price'],
key_on='feature.properties.BoroName',
fill_color='YlOrRd',
#label=df_ny_stats['Borough'], # hover text
#label=folium.Popup('Mom & Pop Arrow Shop >>', parse_html=True),
fill_opacity=0.5,
line_opacity=0.2,
legend_name='House Price in NewYork'#,
#marker_line_color='white', # line markers between states
#colorbar_title="House Price in NewYork"
)
map_newyork
we are going to start utilizing the Foursquare API to explore the neighborhoods for segment them.
Define Foursquare Credentials and Version
I have Limit to 100 venues within a radius of 500 meters.
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 750 # define radius
Method is created to reuse the code for both the borough
neighborhoods = neighborhoods[neighborhoods['Borough'] == 'Queens']
venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
latitudes=neighborhoods['Latitude'],
longitudes=neighborhoods['Longitude'],
radius=radius
)
Astoria Woodside Jackson Heights Elmhurst Howard Beach Corona Forest Hills Kew Gardens Richmond Hill Flushing Long Island City Sunnyside East Elmhurst Maspeth Ridgewood Glendale Rego Park Woodhaven Ozone Park South Ozone Park College Point Whitestone Bayside Auburndale Little Neck Douglaston Glen Oaks Bellerose Kew Gardens Hills Fresh Meadows Briarwood Jamaica Center Oakland Gardens Queens Village Hollis South Jamaica St. Albans Rochdale Springfield Gardens Cambria Heights Rosedale Far Rockaway Broad Channel Breezy Point Steinway Beechhurst Bay Terrace Edgemere Arverne Rockaway Beach Neponsit Murray Hill Floral Park Holliswood Jamaica Estates Queensboro Hill Hillcrest Ravenswood Lindenwood Laurelton Lefrak City Belle Harbor Rockaway Park Somerville Brookville Bellaire North Corona Forest Hills Gardens Jamaica Hills Utopia Pomonok Astoria Heights Hunters Point Sunnyside Gardens Blissville Roxbury Middle Village Malba Hammels Bayswater Queensbridge
print('There are {} uniques categories.'.format(len(venues['Venue Category'].unique())))
There are 315 uniques categories.
# one hot encoding
onehot = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
onehot['Neighborhood'] = venues['Neighborhood']
# move neighborhood column to the first column
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]
onehot.shape
(3785, 315)
grouped = onehot.groupby('Neighborhood').mean().reset_index()
print(grouped.shape)
(81, 315)
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending=False)
return row_categories_sorted.index.values[0:num_top_venues]
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = grouped['Neighborhood']
for ind in np.arange(grouped.shape[0]):
neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(grouped.iloc[ind, :], num_top_venues)
neighborhoods_venues_sorted.head()
| Neighborhood | 1st Most Common Venue | 2nd Most Common Venue | 3rd Most Common Venue | 4th Most Common Venue | 5th Most Common Venue | 6th Most Common Venue | 7th Most Common Venue | 8th Most Common Venue | 9th Most Common Venue | 10th Most Common Venue | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Arverne | Surf Spot | Beach | Deli / Bodega | Sandwich Place | Bus Stop | Donut Shop | Furniture / Home Store | Café | Bed & Breakfast | Gas Station |
| 1 | Astoria | Bar | Greek Restaurant | Middle Eastern Restaurant | Pizza Place | Seafood Restaurant | Coffee Shop | Hookah Bar | Grocery Store | Sandwich Place | Indian Restaurant |
| 2 | Astoria Heights | Rental Car Location | Bakery | Bus Station | Pizza Place | Café | Italian Restaurant | Hotel | Laundromat | Baseball Field | Chinese Restaurant |
| 3 | Auburndale | Korean Restaurant | Cosmetics Shop | Pizza Place | Sushi Restaurant | Greek Restaurant | Sandwich Place | Bank | Pet Store | Pharmacy | Gym / Fitness Center |
| 4 | Bay Terrace | Clothing Store | Cosmetics Shop | Mobile Phone Shop | Women's Store | Kids Store | Lingerie Store | American Restaurant | Donut Shop | Shoe Store | Men's Store |
# set number of clusters
kclusters = 6
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]
array([1, 0, 0, 0, 0, 0, 2, 0, 3, 1])
#neighborhoods_venues_sorted = neighborhoods_venues_sorted.drop('Cluster Labels', axis=1)
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
neighborhoods_merged =pd.merge(neighborhoods, neighborhoods_venues_sorted, on='Neighborhood', how='inner')
#
neighborhoods_merged = neighborhoods_merged.merge(df_ny_stats, on='Borough')
neighborhoods_merged.head() # check the last columns!
| Borough | Neighborhood | Latitude | Longitude | Cluster Labels | 1st Most Common Venue | 2nd Most Common Venue | 3rd Most Common Venue | 4th Most Common Venue | 5th Most Common Venue | 6th Most Common Venue | 7th Most Common Venue | 8th Most Common Venue | 9th Most Common Venue | 10th Most Common Venue | county | population | gdp_billions | gdp_capita | sqms | sqkm | pers_sqms | pers_sqkm | sale_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Queens | Astoria | 40.768509 | -73.915654 | 0 | Bar | Greek Restaurant | Middle Eastern Restaurant | Pizza Place | Seafood Restaurant | Coffee Shop | Hookah Bar | Grocery Store | Sandwich Place | Indian Restaurant | Queens | 2253858 | 93.31 | 41400 | 108.53 | 281.09 | 20767 | 8018 | 709884.787686 |
| 1 | Queens | Woodside | 40.746349 | -73.901842 | 0 | Thai Restaurant | Pizza Place | Bar | Grocery Store | Bakery | Latin American Restaurant | American Restaurant | Pub | Filipino Restaurant | Donut Shop | Queens | 2253858 | 93.31 | 41400 | 108.53 | 281.09 | 20767 | 8018 | 709884.787686 |
| 2 | Queens | Jackson Heights | 40.751981 | -73.882821 | 0 | Latin American Restaurant | South American Restaurant | Bakery | Mexican Restaurant | Peruvian Restaurant | Pizza Place | Thai Restaurant | Grocery Store | Coffee Shop | Donut Shop | Queens | 2253858 | 93.31 | 41400 | 108.53 | 281.09 | 20767 | 8018 | 709884.787686 |
| 3 | Queens | Elmhurst | 40.744049 | -73.881656 | 0 | Thai Restaurant | Mexican Restaurant | Chinese Restaurant | Bakery | South American Restaurant | Latin American Restaurant | Grocery Store | Supermarket | Bubble Tea Shop | Snack Place | Queens | 2253858 | 93.31 | 41400 | 108.53 | 281.09 | 20767 | 8018 | 709884.787686 |
| 4 | Queens | Howard Beach | 40.654225 | -73.838138 | 0 | Italian Restaurant | Bagel Shop | Pharmacy | Fast Food Restaurant | Sandwich Place | Ice Cream Shop | Park | Sushi Restaurant | Other Nightlife | Bus Station | Queens | 2253858 | 93.31 | 41400 | 108.53 | 281.09 | 20767 | 8018 | 709884.787686 |
#neighborhoods_c1 =
neighborhoods_c1 = neighborhoods_merged[neighborhoods_merged['Cluster Labels'] == 0][['Neighborhood', '1st Most Common Venue']].groupby('1st Most Common Venue').count().sort_values (by=['Neighborhood'], ascending=False).head(10).reset_index()
neighborhoods_c1["Cluster"] = 1
neighborhoods_c2 = neighborhoods_merged[neighborhoods_merged['Cluster Labels'] == 1][['Neighborhood', '1st Most Common Venue']].groupby('1st Most Common Venue').count().sort_values (by=['Neighborhood'], ascending=False).head(10).reset_index()
neighborhoods_c2["Cluster"] = 2
neighborhoods_c3 = neighborhoods_merged[neighborhoods_merged['Cluster Labels'] == 2][['Neighborhood', '1st Most Common Venue']].groupby('1st Most Common Venue').count().sort_values (by=['Neighborhood'], ascending=False).head(10).reset_index()
neighborhoods_c3["Cluster"] = 3
neighborhoods_c4 = neighborhoods_merged[neighborhoods_merged['Cluster Labels'] == 3][['Neighborhood', '1st Most Common Venue']].groupby('1st Most Common Venue').count().sort_values (by=['Neighborhood'], ascending=False).head(10).reset_index()
neighborhoods_c4["Cluster"] = 4
neighborhoods_c5 = neighborhoods_merged[neighborhoods_merged['Cluster Labels'] == 4][['Neighborhood', '1st Most Common Venue']].groupby('1st Most Common Venue').count().sort_values (by=['Neighborhood'], ascending=False).head(10).reset_index()
neighborhoods_c5["Cluster"] = 5
neighborhoods_c6 = neighborhoods_merged[neighborhoods_merged['Cluster Labels'] == 5][['Neighborhood', '1st Most Common Venue']].groupby('1st Most Common Venue').count().sort_values (by=['Neighborhood'], ascending=False).head(10).reset_index()
neighborhoods_c6["Cluster"] = 6
df_tmp = pd.concat([neighborhoods_c1, neighborhoods_c2])
df_tmp = pd.concat([df_tmp, neighborhoods_c3])
df_tmp = pd.concat([df_tmp, neighborhoods_c4])
df_tmp = pd.concat([df_tmp, neighborhoods_c5])
df_tmp = pd.concat([df_tmp, neighborhoods_c6])
df_tmp.sort_values (by=['Cluster'], ascending=True).reset_index()
df_final = df_tmp
A map plotted to visualize the neighborhoods by its clusters - chloropleth map is created to highlight the boroughs by its house average sales.
from folium import ColorMap, LinearColormap, StepColormap
map_clusters = empty_map
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10.5)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
neighborhoods_merged['text'] = neighborhoods_merged['Borough'] + neighborhoods_merged['Neighborhood'] + '<br>' #+ \
#'Most Common Venue(s) ' + neighborhoods_merged['1st Most Common Venue'] + ', ' + neighborhoods_merged['2nd Most Common Venue']
#' Dairy ' + neighborhoods_merged['dairy'] + '<br>' + \
#'Population ' + neighborhoods_merged['total fruits'] + ' Per SQM ' + neighborhoods_merged['total veggies'] + '<br>' + \
# add markers to the map
markers_colors = []
with open('nyborough.geojson') as json_data:
nyborough_data = json.load(json_data)
map_clusters.choropleth(
geo_data=nyborough_data,
data=df_ny_stats,
columns=['Borough', 'sale_price'],
key_on='feature.properties.BoroName',
fill_color='YlOrRd',
#text=df_ny_stats['Borough'], # hover text
fill_opacity=0.7,
line_opacity=0.2,
legend_name='House Price in NewYork'#,
#marker_line_color='white', # line markers between states
#colorbar_title="House Price in NewYork"
)
for cluster, lat, lon, poi, borough, avg_hp in zip(
neighborhoods_merged['Cluster Labels'],
neighborhoods_merged['Latitude'],
neighborhoods_merged['Longitude'],
neighborhoods_merged['Neighborhood'],
neighborhoods_merged['Borough'],
neighborhoods_merged['sale_price']):
label = folium.Popup( 'cluster '+str(cluster) + ' - ' + str(poi) + ' : ' + str(int(avg_hp)) + '. NH :' , parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[cluster-1],
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(map_clusters)
map_clusters
Cluster visualization, bar graphs developed to visualize what venue categories are available for each clusters.
plt.figure(figsize=(8,12))
plt.suptitle('Cluster Analysis', y=1.03, fontsize=16, fontweight='semibold');
plt.subplot(3, 2, 1);
g = sns.barplot(data=neighborhoods_c1, x="1st Most Common Venue", y = 'Neighborhood');
g.set_title('Cluster 1 (0)', y=1.03, fontsize=14, fontweight='semibold');
g.set_xlabel('Venues');
g.set_ylabel('No of occurences');
plt.xticks(rotation=90)
plt.subplot(3, 2, 2);
g = sns.barplot(data=neighborhoods_c2, x="1st Most Common Venue", y = 'Neighborhood');
g.set_title('Cluster 2 (1)', y=1.03, fontsize=14, fontweight='semibold');
g.set_xlabel('Venues');
g.set_ylabel('No of occurences');
plt.xticks(rotation=90)
# code for the pie chart
plt.subplot(3, 2, 3);
g = sns.barplot(data=neighborhoods_c3, x="1st Most Common Venue", y = 'Neighborhood');
g.set_title('Cluster 3 (2)', y=1.03, fontsize=14, fontweight='semibold');
g.set_xlabel('Venues');
g.set_ylabel('No of occurences');
plt.xticks(rotation=90)
plt.suptitle('Cluster Analysis', y=1.03, fontsize=16, fontweight='semibold');
base_color = sns.color_palette()[0]
# code for the bar chart
plt.subplot(3, 2, 4);
g = sns.barplot(data=neighborhoods_c4, x="1st Most Common Venue", y = 'Neighborhood');
g.set_title('Cluster 4 (3)', y=1.03, fontsize=14, fontweight='semibold');
g.set_xlabel('Venues');
g.set_ylabel('No of occurences');
plt.xticks(rotation=90)
plt.subplot(3, 2, 5);
g = sns.barplot(data=neighborhoods_c5, x="1st Most Common Venue", y = 'Neighborhood');
g.set_title('Cluster 5 (4)', y=1.03, fontsize=14, fontweight='semibold');
g.set_xlabel('Venues');
g.set_ylabel('No of occurences');
plt.xticks(rotation=90)
# code for the pie chart
plt.subplot(3, 2, 6);
g = sns.barplot(data=neighborhoods_c6, x="1st Most Common Venue", y = 'Neighborhood');
g.set_title('Cluster 6 (5)', y=1.03, fontsize=14, fontweight='semibold');
g.set_xlabel('Venues');
g.set_ylabel('No of occurences');
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
Clusters are categorized as follows
Data exploration results helps to understand the population denisty , GDP and housing pricing to compare and decide which borough to choose for clustering analysis. From the results on clustering, based on the type of business one is investing can choose the cluster and the map helps in selecting the area in which the business can be built.
As mentioned in the introduction, Newyork is a popular state in US and has very good GDP. The city has 5 boroughs and for this project I have considered Queens based on the real estate pricing (low) with good population density and good GDP rate. Neighborhood data is filterd by Queens borough and the venues for these neighborhoods is collected using FourSquare API. Then I have taken only the top 10 venues per neighborhood based on the number of the venue categories listed per neighborhood. I have used K-Means clustering to cluster the top venues into 3 clusters and cluster based venues helps in defining the cateogories - then using this categorization a map was generated to display the neighborhood and its respective category. Also the boroughs are classified based on the housing price index to highlight where each borough stands in a visual way to understand.
The popular city new york in United State has Manhattan as the most expensive borough with median population comparing with its peer boroughs. Manhattan also provides 72% of the GDP. Business investors with high budget can opt for Manhattan which has better GDP. Queens with its high population and low average housing cost could be a better investment option. Cluster 1/4 with high number of restaurant venues should be a good place for investing on related business that supports restaurants and bars. eg. Food supplies etc.
Also further analysis can be done for different boroughs, cities depending on the investors needs. This can be enhanced further to support government institutions to help improve the efficiency.